InfoVis 2003 Contest - InfoZoom Entry

Michael Spenke, Christian Beilken

{Michael.Spenke,Christian.Beilken}@fit.fraunhofer.de

FIT - Fraunhofer Institute for Applied Information Technology

See Infovis 2003 Contest rules and task at http://www.cs.umd.edu/hcil/iv03contest/

Ratings used below: (Strength,Possible,Difficult,Not Available)

Pairwise comparisons of trees: Topological changes

Did anything change, in general, or in a subtree?

Rating:: Strength
Process:: A simple side-by-side comparison of two or more trees gives a first impression of the differences.; Marking a cell in one tree also marks it in the other tree.; This makes it easy to compare the size of corresponding cells. (First image); In order to compare two or more trees precisely we first need a mapping between the nodes of the trees,; which defines when two nodes in different trees are regarded as identical.; In the animal classification trees each animal has a Latin name which is unique within a tree.; Therefore, we consider two animals in different trees as identical, if and only if they have the same Latin name.; Using the the derived attribute Count(Tree) per Latin Name we can exactly determine which animals are found in both trees
and which are only contained in one of the trees.; This is explained in detail further below in the answer to the application specific question; "To what extent are the differences in the classifications due to differences in how animals are thought to be related?".; The attribute Latin Path contains the full path name of each animal.; Therefore, the derived attribute Count(Latin Path) per Latin Name can be used to find all animals that are differently classified in the two trees.; This is also explained in detail further below in the answer to the question mentioned above.; In the file system logs there is no unique identification of a file besides the full path name.; Consequently, it is impossible to find files that were renamed or moved to another directory.; We can only see that some files are missing in a later snapshot and new files have appeared.; These files can be exactly determined using the derived attributes List(week) per file path and; Count(week) per file path. We simply zoom on all files which do not appear in all 5 snapshots.; (Second, third, and fourth image); Another approach to find the differences between the snapshots is based on the creation and modification times of the files.; This is explained in detail further below in the answer to the application specific question; "Were there a lot of pages created recently? If so, in which part of the file system?"

Image:: The two animals trees side by side; Most files are found in all 5 snapshots.; Some files are found in only 4 or less snapshots.; Overview of the files found in only 4 or less snapshots.
Answer:

4701 files do not appear in all 5 snapshots.
Most of them are found in the toplevel directories users, usersrschulz, and class, especially in class/spring2003.

What nodes were added, deleted?

Rating:: Strength
Process:: We can exactly determine the nodes that are contained in one tree but not in the other.; For details see previous question.
Image:: See previous question
Answer:: See previous question

Did any node or subtrees "move" in the tree? Can you characterize those movements?

Rating:: Difficult
Process:: Using the derived attribute Count(Latin Path) per Latin Name,
we can exactly determine the animals that are contained in both trees but with a different classification.; For details see first question.

It is not, however, possible to automatically decide if the differences found are the result of the move of a complete subtree.; Some manual browsing is necessary here.
Image:: See first question
Answer:: See first question

Pairwise comparisons of trees: Attribute value changes

Global impression: did things change a lot or not?

Rating:: Strength
Process:: We define the derived attributes Average(hitCount) per week and Average(size in KB) per week and sort by week.
Image:
Answer:

The average file size increases from week to week
The average hit count increases after week B.

What nodes or subtrees changed the most?

Rating:: Strength
Process:: To answer this question, we defined a derived attribute Minimum(hitCount) per file path / Maximum(hitCount) per file path.
After a zoom on the highest values of this attribute, we see the files with the highest increase rates of hitCount within the five snapshots.
Image:
Answer:: See image.

Did the value of attribute XYZ for this node increase or decrease? In absolute terms, or relatively to other siblings or other nodes.

Rating:: Strength
Process:: We zoomed on the file /index.html in all 5 weeks.; Then we sorted the table by week.
Image:
Answer:: The values for attribute hitCount show an increase from week B to E.

General visualization of trees: Topology

Overall characteristics: How large is the tree? How many levels deep? What is the deepest branch? Does the depth vary between subtrees or not?

Rating:: Possible
Process:

The number of currently displayed objects is always shown in the upper left corner of the table.
In our representation of the tree as a table, each animal has 19 attributes for the levels from Phylum down to Species.
For each animal some of these attributes do not have a value, i.e. the value is the empty string.
The number of non-empty values of an animal corresponds to its branch length.
We defined a derived attribute Branch Length with the following somewhat long but straightforward formula:
if([Kingdom] = "",0,1) + if([Phylum] = "",0,1) + if([Subphylum] = "",0,1) + if([Superclass] = "",0,1) +
if([Class] = "",0,1) + if([Subclass] = "",0,1) + if([Infraclass] = "",0,1) + if([Superorder] = "",0,1) +
if([Order] = "",0,1) + if([Suborder] = "",0,1) + if([Infraorder] = "",0,1) + if([Superfamily] = "",0,1) +
if([Family] = "",0,1) + if([Subfamily] = "",0,1) + if([Tribe] = "",0,1) + if([Subtribe] = "",0,1) +
if([Genus] = "",0,1) + if([Subgenus] = "",0,1) + if([Species] = "",0,1)

Image:: Latin classification

Common classification
Answer:

The attribute Subtribe almost never has a value. This means that this level is missing for most of the animals.
The deepest branch has 14 named levels.
We can exactly zoom on the animals with this branch depth.
For example: /Arthropoda/Hexapoda/Insecta/Pterygota/Neoptera/Hymenoptera/Apocrita/Scolioidea/Formicidae/Formicinae/Camponotini/Polyrhachis/Polyrhachis tibialis/Polyrhachis tibialis robustior
In the classification with common names most of the levels do not have a name.

Path: What is the path of this node?

Rating:: Strength
Process:: The path of an animal is given by its attribute values and also by its attribute Latin Path.
Image:
Answer:: Not applicable

Local relatives: What are the children, siblings, or cousins of this node?

Rating:: Strength
Process:: This can be directly seen.
Image:
Answer:

Brachycera and Nematocera are the children of Diptera.
Coleoptera, Hymenoptera, and Trichoptera are siblings of Diptera.
Heteroptera is a cousin of Diptera.
Some siblings and cousins are too small to read. We can look into the value menu or zoom on them to see their names.

Filtering by level: Show only the first level, or show only 3 levels down, or remove all the leaves

Rating:: Strength
Process:

Attributes can be hidden temporarily or permanently.
The projection on a subset of the attributes can be constructed as a new table

Image:: Levels below Class are hidden; Projection on the top 5 levels
Answer:

In the first image some levels are hidden, but the number of animals remains unchanged.
The second image shows the projection on the toplevels until Class.
There are only 112 columns because a column corresponds to a Class and not to an animal any more.
Cell sizes differ from the first image, because the width is proportional to the number of classes here.

Topologies question that involve counting nodes can be seen as attribute dependant questions: e.g. Which branch contains the largest number of nodes? or Which branch has the largest fan-out?

Rating:: Strength
Process:: Just look for the widest cells in each row. If you are unsure: Open the value menu and sort by frequency. Shown in the image for Order.
Image:
Answer:: The subtrees with the largest fanouts are the Phylum Arthropoda, the Subphylum Hexapoda, the Class Insecta, the Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera with 15117 entries.

General visualization of trees: Attribute based

Find nodes with high values of a numerical attribute X? (relative query)

Rating:: Strength
Process:: Switch to Overview Mode.; Select a range at the right end of the attribute's row and zoom in.; Repeat until the the rightmost cell is large enough to display the value.; Alternatively, open the value list dialog and sort it backwards. The highest value is displayed on top
Image:
Answer:: Not applicable

Find nodes with given value of a numerical attribute X? (absolute query)

Rating:: Strength
Process:: Switch to Overview Mode.; If the value can be already seen, just double-click it.; Otherwise select a rough range around the value and zoom in, possibly in several steps.; Alternatively, the value can be directly selected in the value list dialog, which might be very long, however.
Image:
Answer:: Not Applicable

Find nodes with value Y of categorical attribute X - What value of a categorical attribute occurs more often? e.g. Are there more farm animals or pets?

Rating:: Strength
Process:: In the Overview Mode the value distribution of all attributes is shown.; The width of each cell is proportional to the number of files with this value.; For cells that are too small we can lookup the size in the value list dialog.; The value list can be sorted by frequency, so that the largest values/cells are on top.
Image:
Answer:: html, gif, and jpg are the most frequent file formats.

Find nodes with certain values of two or more attributes (What video file is used the most?)

Rating:: Strength
Process:: Open the value list dialog of attribute extension.; Select the video formats like avi, mpg, mov and zoom in.; Zoom on the highest values of hitCount.
Image:
Answer:: /projects/hcil/kiddesign/icdl/icdl.mpg is used the most.

Number of nodes in a tree or subtree? (How many animals? How many mammals?)

Rating:: Strength
Process:: There are several possibilities

Mark a cell. The tool tip will show the number of animals represented by this cell

Open the value list dialog box. The frequency of each value is displayed there.

Zoom on the subtree. The number of displayed objects is shown in the upper left corner of the table.

Image:
Answer:: The subtree of Insecta contains 64423 animals.

Comparison of branches of the tree (Subtrees with most nodes; are there more mammals or fish?)

Rating:: Strength
Process:

See question "Where are the big directories?" below.
Zoom on mammals and fish. Compare size of the two cells.

Image:

Answer:: There are more bony fishes than mammals.

Largest fanout (What is the largest group of animals with same lineage?

Rating:: Strength
Process:: Just look for the widest cells in each row. If you are unsure: Open the value menu and sort by frequency. Shown in the image for Order.
Image:
Answer:: The subtrees with the largest fanouts are the Phylum Arthropoda, the Subphylum Hexapoda, the Class Insecta, the Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera with 15117 entries.

General visualization of trees: Known items

Which nodes have a particular string in their label? (Find "giraffe" in a tree of animals)

Rating:: Strength
Process:: We perform a full-text search in all attributes for "giraffe".; This zooms on all animals that contain "dolphin" in at least one of its attributes.
Image:
Answer:: There are only two animals with giraffe in their common names.

Locate a node knowing its path

Rating:: Strength
Process:: Just click onto the cells containing the next label. If the cell is too small select a range containing the label or use the value menu.
Image:: Not Applicable
Answer:: Not Applicable

Go back to a node you have visited before

Rating:: Strength
Process:: There are several techniques:

Use the history mechanism by pressing the back button until you find the node.

For book marking insert a new attribute while you are zoomed onto the node.
The visible records get a number within this new attribute, while other nodes
get an empty entry. This makes it easy to zoom onto this node again.

Insert and name a query that saves the steps, which lead to this node.
This works like a macro, which can replay the steps to this node at any time.

Image:: Not Applicable
Answer:: Not Applicable

General visualization of trees: Labeling

Review all the labels in a subtree

Rating:: Strength
Process:: First we zoom on the subtree, e.g. Insecta.; For each rank we can get a popup-window with a list of all labels.
Image:
Answer:: The image shows a list of all species in the Insecta subtree.

General visualization of trees: Browsing

Explore the tree by performing a series of up and downs in the tree

Rating:: Strength
Process:: This is done by zoom-in and zoom-out operations.
Video:: Click to see video
Answer:: Not Applicable

General visualization of trees: Managing the analysis

Marking nodes of interest

Rating:: Possible
Process:: Any subset of the displayed cells can be marked (selected) using the mouse (click, drag, ctrl-click, shift-click).; However, the marking is quite volatile: The next mouse click into the table will remove it.; Another way to mark a set of records is to create a new attribute interesting and to set its values to yes or no.; This rests on the fact that InfoZoom is also a very powerful editor:; We can simply select a cell or a range of cells and directly edit its values like in a spreadsheet.; The modification is performed in all records represented by the cell.; In this way, thousands of records can be modified in a single operation.

Once the attribute interesting is defined, we can later zoom on just the interesting values.
Image:
Answer:: Not Applicable

Removing special anomalies

Rating:: Strength
Process:: InfoZoom is also a very powerful editor.; We can simply select a cell and directly edit its value like in a spreadsheet.; The modification is performed in all records represented by the cell.; In this way, thousands of records can be modified in a single operation.
Image:

Animals with different classifications in the two trees

Answer:: We experimentally cleared the selected cells in the above image.; Afterwards about 1000 animals did not have different classifications anymore.

Saving visualization settings for future reference

Rating:: Strength
Process:: The navigation history is stored as a sequence of commands.; A command sequence can be stored as a named query.; In order to perform a query later, InfoZoom executes the stored navigation commands.
Image:
Answer:: Not Applicable

Keeping the history of your analysis, reviewing it and replaying it with different parameters

Rating:: Strength
Process:: The navigation history is stored as a sequence of commands.; Using the back and forward buttons we can get an animated replay of our interaction.; The buttons also have an associated menu that shows the command history.; It can be used to jump directly to a saved state.
Image:
Answer:: Not Applicable

Phylogenies: Application specific tasks

This data set was not analyzed with InfoZoom.

Classifications: Application specific tasks

To what extent are the differences in the classifications due to differences in how animals are thought to be related? Are there other kinds of differences and can you explain them?

Rating:: Strength
Process:: There are two kinds of differences:

Some animals exist in only one of the trees

Some animals are differently classified in the two trees

These differences can be exactly determined:

Latin Name uniquely identifies an animal within each tree.

We define a derived attribute Count(Tree) per Latin Name.

The resulting value is 2 for most of the animals.

This means that they are contained in both of the trees.

But some animals are found in only one tree.

We can zoom on them by clicking on the 1.

The derived attribute Latin Path is the full path name of each animal.

It is similar to a fully qualified file name.
We define Count(Latin Path) per Latin Name and zoom on the animals where the result is 2.

These have a different classification in the two trees.

We can also use color coding in order to highlight the areas of the overall trees where there are different classifications.

To achieve this, we specify that the attribute Count(Latin Path) per Latin Name defines the coloring.

Each cell is now colored according to the average value of Count(Latin Path) per Latin Name of the animals it represents.

Therefore, red areas have less differences than the average, green areas contain more differences.

We also defined several attributes like

Count(Phylum) per Latin Name
Count(Class) per Latin Name
Count(Family) per Latin Name

in order to spot the differences more precisely.

Image:: Animals contained in only one of the trees; Animals with a different classification in the two trees; Color Coding of the frequency of different path names
Answer:

In the first image we can see that mainly chordates are found in only one tree.
In the second image we can see that a main reason for different paths are some subclasses and infraclasses not used in tree B at all.
In the third image the green cells, mainly Chordata and especially Aves, contain many animals with two different classifications.

Using Count(Phylum) per Latin Name we detected that the 17 animals of Genus Apus, even belong to two different Phylums, namely Chordata in A, but Arthropoda in B!
Ensifera ensifera also belongs to two different Phylums (chordates/arthropods) because it is not clear whether it is a bird or an insect.
Among others 2967 perching birds belong to different families.

Can you say in how many different subtrees a particular common name (such as "dolphin" or "horse") is used? How closely are these animals related? Are common names a good guide to understanding relationships?

Rating:: Strength
Process:: We perform a full-text search in all attributes for "dolphin".; This zooms on all animals that contain "dolphin" in at least one of its attributes.
Image:: Find-Dialog; Result of full-text search for "dolphin"; Result of full-text search for "horse"

Answer:

The search for "dolphin" returns 54 animals:
Many marine dolphins and river dolphins, but also a bird called Myzomela adolphinae and a clam called Nucula delphinodonta.
The common family names marine dolphins and river dolphins correspond to the Latin family names Delphinidae, Iniidae, and Platanistidae.
In this case the common names reveal a relationship not reflected in the Latin names.
In general, however, common names are not very useful because for most animals there is no common name at all.
The search for "horse" results in more than 1000 animals (in A and B), most of them are seahorses and horse flies.
The family horses contains only 14 animals, namely the mammals.

How many species or subspecies are named after biologists named "Townsend"?

Rating:

Strength

Process:

We perform a full-text search in Latin Name and Common Name for "townsend".

Image:

Answer:

In Tree A there are 48 Latin Names and 15 Common Names which contain "townsend".

What kind of feedback does your tool provide to alert the user quickly when a wrong name is entered?

Rating:

Strength

Process:

We perform a full-text search in all attributes for "Spirurida" and then "Spirulida".

Image:

Result of full-text search for "Spirurida"

Result of full-text search for "Spirulida"

Answer:

The first image clearly shows that the expected result was not obtained.

For the top five subtrees with the most nodes-- are they likely to have a parent of a particular rank? Or does this happen in many ranks? Can you comment on how useful "rank" is?

Rating:: Strength
Process:: We do not completely understand the question.; We try to answer it anyway.

The size of subtrees is proportional to the width of the cells.; Looking at the complete tree A we see several large cells at different levels.

Image:
Answer:

The family Formicidae is quite large, even larger than many of the classes and phylums.
The 5 largest subtrees are the Phylum Arthropoda, the Subphylum Hexapoda, the Class Insecta, the Subclass Pterygota, and the Superorder Neoptera.

File system and usage logs: Application specific tasks

Introduction

As with the animal tree, we had to transform the XML files to an object/attribute table.
Each leaf of the tree, i.e. each file, constitutes a column of the table.
Each row of the table corresponds to a file attribute.

The inner nodes are the directories. They are also represented as attributes of each file:

File path is the complete path name of a file, e.g. /class/fall2002/cmsc414/index.html.
Name is the part after the last slash, e.g. index.html.
Directory path is the part before the last slash, e.g. /class/fall2002/cmsc414.

Five complete snapshots of the file system have been taken at the end of weeks A to E.
They were all combined into one large table.

In order to get a first overall impression of the whole data set, we start in InfoZoom’s Overview Mode.
Other than in the Compressed Table Mode, the data set is not visualized as a table here.
Instead each row independently shows the value distribution of an attribute.
The size of each cell is proportional to number of files with that value.

The following observations can be made:

There are 343,614 file/week pairs.
The snapshots from the 5 weeks have roughly the same number of files.
The most frequent name is index.html(see selected cell and edit line above the table)
html/htm and gif are the most frequent file types.
users, class, usershollings, and projects are the largest toplevel directories.
On lower levels lectures is the most frequent directory name.
More than 50% of the files were never hit, about 90% less than 6 times.
Userid 834 is the owner of the most files.
Size and userid are missing for about 20% of the files.

Browsing is performed by interactively zooming into sub areas.

This is demonstrated in a video.

Click to see video

For example, we can double-click the cell containing the value A, to zoom on the first snapshot only.
The screen shot shows the result after a second zoom on projects and a switch to the Compressed Table Mode:

We can observe the following:

The projects hcil and hpsl contain the most files.
The directory jazz-chat contains mainly html-files.
hpsl contains a large directory called ppt.

Where are the big directories?

There are several different interpretations of this question:

Which toplevel directories do contain the highest number of files?
Which toplevel directories do occupy the most disk space?
Which directories do directly contain the highest number of files?
In which directories do the directly contained files occupy the most disk space?

Interpretation A: Which toplevel directories do contain the highest number of files?

Alternative 1:

Rating:: Strength
Process:: Big directories are immediately visible since the width of each cell is proportional to the number of files it represents.
Image:

Answer:

It's obvious that users and class are the biggest toplevel directories.
The size of lower level directories can be observed in the same way

Alternative 2:

Rating:: Strength
Process:: The value list for toplevel directory can be sorted by frequency. The frequency is identical to the number of files.
Image:

Answer:

users contains 18723 files.
class contains 16978 files.

Interpretation B: Which toplevel directories do occupy the most disk space?

Alternative 1:

Rating:: Strength
Process:: Define a derived attribute Sum(size in KB) per toplevel directory and exclude the unknown sizes.
Image:

Answer:

projects occupies 1,6 GB

class occupies 1,4 GB
users occupies 1,1 GB

Alternative 2:

Rating:: Strength
Process:: Declare size in KB as the attribute that determines the column width of the table.; Normally, all columns have the same width, and therefore the width of each cell is proportional to the number of files it represents.; In the image below, however, the width of each cell is proportional to the total size of the files it represents.
Image:

Answer:

We can easily see that projects is the biggest toplevel directory, mainly because of the papers in pdf-format.
In /movies/monte there are some very large individual files (see large cells for the attribute file path).

Interpretation C: Which directories do directly contain the highest number of files?

Rating:: Strength
Process:: Define a derived Count(file path) per directory path and zoom on its highest values.
Image:

Answer:

/ projects / hcil / jazz / list-archives / jazz-chat contains 950 files

/ users / building / pitr / pics / tn contains 510 files

Interpretation D: In which directories do the directly contained files occupy the most disk space?

Rating:: Strength
Process:: Define a derived Sum(size in KB) per directory path and zoom on its highest values.
Image:

Video: Click to see Video

Answer:

We can see that /movies/monte and /projects/SoftEng/ESEG/papers are the two biggest directories.

Can you see different patterns in the files? (Can you make out the difference between personal pages, class pages and research project pages?)

Rating:

Strength

Process:

Zoom on each of the four largest toplevel directories.

Image:

Toplevel directories

Toplevel directory users

Toplevel directory projects

Toplevel directory class

Answer:

users, projects, class and usershollings are the biggest toplevel directories.
/users/hollings seems to be very similar to usershollings.
The most other user files also can be found at two places.
For the files in the /users<name> directories hitCount is 0 and there is no information about file size and creation/modification time.
hcil and hpsl are the projects with the most files, mainly because of the large directories /projects/hcil/jazz/list-archives/jazz-chat and /projects/hpsl/classes/818s-s98/ppt/titan.

Were there a lot of pages created recently? If so, in which part of the file system?

Rating:: Strength
Process:: We used the creation and modification times of the files to answer this question.; First of all we defined the derived attribute mtime >= ctime. Somewhat surprising the result was false in most cases.; On the other hand ctime >= mtime is true in 99.9% of the cases. (There are 59 exceptions!?); So obviously the two attributes ctime and mtime had been swapped by mistake in the data set.
We corrected this error and defined a few attributes derived from ctime and mtime:; It can be seen that the vast majority of the files where created and modified before the first snapshot.; Most files where even modified at the same day in August 2002!; In order to answer the question we zoomed on all files with a creation day 2003/1/25 or later in snapshot E.

Image:
Answer:

1,435 files have been created after 2003/1/24.
In the class directory many new files were created, mainly in the subdirectory spring2003.
In Library many new bib files were created.
In project/hcil there are also a lot of new files.

Are the newer directories bigger than the older projects?

We compared the number of files in the subdirectories of project in week A and E.

Rating:: Strength
Process:: We defined a new attribute Count(file path) per level2 and week.; Next we zoomed on the weeks A and E and then on projects.
Image:
Answer:: The image shows the three biggest directories. None of them grew between A and E.

When was the page giving directions to the department last updated?

Rating:: Strength
Process:: A full-text search for "directions" shows a few files containing that string in their names.
Clicking on the toplevel directory department shows the result.
Image:
Answer:: The file /department/directions.shtml was last modified on 2002/8/31, 03:43:19

Which are the popular webpages?

Alternative 1:

Rating:: Strength
Process:: In order to find the most popular files we simply zoom on the highest values of the attribute hitCount.
Image:: Click to see video
Answer:: It turns out that /index.html and /index.shtml have the most hits, which is not very surprising

Alternative 2:

Rating:: Strength
Process:: A more interesting question is to find the most popular pdf-files.; To accomplish this, we simply double-click pdf before zooming on the highest values of hitCount.
Image:: Click to see video
Answer:

/ users / chiraz / thesis.pdf had 127 hits in week A
/ class / spring2002 / cmsc818m / doc / 0220 / expanding.pdf had 210 hits
/ Library / TRs / CS-TR-4405 / CS-TR-4405.pdf had 251 hits.

Alternative 3:

Rating:: Strength
Process:: We can also use an attribute-dependent column width here.
In the image below, the width of each cell is proportional to the hitCount.
So large cells represent popular web pages.
Moreover, we have defined a derived attribute Sum(hitCount) per toplevel directory and we have sorted the table by this new attribute.
Image:
Answer:: We can observe that projects is the most popular top level directory and hcil is the most popular project, but mainly because of the banner-images in gif-format.

Are there some labs more popular than others?

Rating:: Strength
Process:: Zoom on projects and define Sum(hitCount) per level 2.
Image:
Answer:: hcil, plus, and hpsl are the most popular labs.

Which areas are getting more popular? Less popular?

Rating:: Strength
Process:: We define the attribute hitCount as the color giving attribute (indicated by the traffic lights left of its attribute name).
Then we zoom into weeks A and E.
This shows cells with more hits than the average in green and cells with fewer hits than the average value in red.
Comparing weeks A and E the change of color can indicate that the number of hits increased or decreased.
Image:
Answer:: The toplevel directory class gets about three times more hits in week E than in week A (from 2.11 to 6.46).
Directory users decreased from 3.06 to 2.87 in average.

Are new pages more popular that old pages?

Rating:: Strength
Process:: We defined a new attribute creation year derived from creation day,
and another derived attribute Average(hitCount) per creation year:
Image:
Answer:: It turned out that the average hit count in general is lower for old files. The year 2000 is an exception. Another exception are the 3 files created in 1993. These have an average hit count of about 60. Mainly because the file /users/samir/khuller.gif was hit 129 times.

Which old pages are popular?

Rating:: Strength
Process:: We zoomed on the files created in 1999 or before and on the highest hit counts.
This showed some banner images in GIF-format.; It is more interesting to look for popular papers. Therefore we focused on ps- and pdf-files.; Moreover, we defined the derived attribute Sum(hitCount) per file path to sum up the hit counts of the 5 weeks for each file.
Image:
Answer:: See image.

What proportion of the pages are never used?

Rating:: Strength
Process:: The distribution of hit counts can be directly seen.
Image:
Answer:: About 50% of the pages are never used.

What proportion of the pages are seldom used?

Rating:: Strength
Process:: The distribution of hit counts can be directly seen.
Image:
Answer:: About 90% of the pages are used 5 times or less per week.